19 research outputs found

    Breast cancer prognosis by combinatorial analysis of gene expression data

    Get PDF
    INTRODUCTION: The potential of applying data analysis tools to microarray data for diagnosis and prognosis is illustrated on the recent breast cancer dataset of van 't Veer and coworkers. We re-examine that dataset using the novel technique of logical analysis of data (LAD), with the double objective of discovering patterns characteristic for cases with good or poor outcome, using them for accurate and justifiable predictions; and deriving novel information about the role of genes, the existence of special classes of cases, and other factors. METHOD: Data were analyzed using the combinatorics and optimization-based method of LAD, recently shown to provide highly accurate diagnostic and prognostic systems in cardiology, cancer proteomics, hematology, pulmonology, and other disciplines. RESULTS: LAD identified a subset of 17 of the 25,000 genes, capable of fully distinguishing between patients with poor, respectively good prognoses. An extensive list of 'patterns' or 'combinatorial biomarkers' (that is, combinations of genes and limitations on their expression levels) was generated, and 40 patterns were used to create a prognostic system, shown to have 100% and 92.9% weighted accuracy on the training and test sets, respectively. The prognostic system uses fewer genes than other methods, and has similar or better accuracy than those reported in other studies. Out of the 17 genes identified by LAD, three (respectively, five) were shown to play a significant role in determining poor (respectively, good) prognosis. Two new classes of patients (described by similar sets of covering patterns, gene expression ranges, and clinical features) were discovered. As a by-product of the study, it is shown that the training and the test sets of van 't Veer have differing characteristics. CONCLUSION: The study shows that LAD provides an accurate and fully explanatory prognostic system for breast cancer using genomic data (that is, a system that, in addition to predicting good or poor prognosis, provides an individualized explanation of the reasons for that prognosis for each patient). Moreover, the LAD model provides valuable insights into the roles of individual and combinatorial biomarkers, allows the discovery of new classes of patients, and generates a vast library of biomedical research hypotheses

    PATTERN-BASED FEATURE SELECTION IN GENOMICS AND PROTEOMICS

    No full text
    Abstract. A major difficulty in data analysis is due to the size of the datasets, which contain frequently large numbers of irrelevant or redundant variables. In particular, in some of the most rapidly developing areas of bioinformatics, e.g., genomics and proteomics, the expressions of the intensity levels of tens of thousands of genes or proteins are reported for each observation, in spite of the fact that very small subsets of these features are sufficient for distinguishing positive observations from negative ones. In this study, we describe a two-step procedure for feature selection. In a first “filtering ” stage, a relatively small subset of relevant features is identified on the basis of several combinatorial, statistical, and information-theoretical criteria. In the second stage, the importance of variables selected in the first step is evaluated based on the frequency of their participation in the set of all maximal patterns (defined as in the Logical Analysis of Data, and generated using an efficient, total-polynomial time algorithm), and low impact variables are eliminated. This step is applied iteratively, until arriving to a Pareto-optimal “support set”, which balances the conflicting criteria of simplicity and accuracy

    Pattern-based clustering and attribute analysis

    No full text
    The Logical Analysis of Data (LAD) is a combinatorics, optimization and logic based methodology for the analysis of datasets with binary or numerical input variables, and binary outcomes. It has been established in previous studies that LAD provides a competitive classification tool comparable in efficiency with the top classification techniques available. The goal of this paper is to show that the methodology of LAD can be useful in the discovery of new classes of observations and in the analysis of attributes. After a brief description of the main concepts of LAD, two efficient combinatorial algorithms are described for the generation of all prime, respectively all spanned, patterns (rules) satisfying certain conditions. It is shown that the application of classic clustering techniques to the set of observations represented in prime pattern space leads to the identification of a subclass of, say positive, observations, which is accurately recognizable, and is sharply distinct from the observations in the opposite, say negative, class. It is also shown that the set of all spanned patterns allows the introduction of a measure of significance and of a concept of monotonicity in the set of attributes

    Comprehensive vs. comprehensible classifiers in logical analysis of data

    Get PDF
    AbstractThe main objective of this paper is to compare the classification accuracy provided by large, comprehensive collections of patterns (rules) derived from archives of past observations, with that provided by small, comprehensible collections of patterns. This comparison is carried out here on the basis of an empirical study, using several publicly available data sets. The results of this study show that the use of comprehensive collections allows a slight increase of classification accuracy, and that the “cost of comprehensibility” is small

    New Tools for Spatial Intelligence Education: the X-Colony Knowledge Discovery Kit

    No full text
    This study introduces a new framework for developing spatial education programs based on a geometric language and manipulation of ensembles of polyhedra, called X-Colony Knowledge Discovery Kit (KDK). The KDK main goals are to develop spatial intelligence, creativity, strategic planning, forecasting skills, abstract reasoning, self-confidence, and social skills. Landmark studies document that spatial education plays a central role in driving performance in science, technology, engineering and mathematics (STEM) occupations, yet spatial education is under-studied and the infrastructure for research on spatial learning is at the beginning. KDK introduces a novel geometric language that allows visual communication and develops spatial abilities by engaging students to perform creative paper folding and various mental spatial transformations. KDK is organized in program sessions consisting of cooperative open-end paper construction activities that engage students to build modular constructions of gradual complexity and to explore various strategies for combining the constructs into novel configurations. KDK supports the Core Math Standard and Science curricula and provides students the opportunity to discover connections between mathematics, science and various other disciplines. A pilot case-control study conducted with fifth grade students indicates an average increase of 17% in geometric reasoning after 8 hours of KDK activities

    Comprehensive vs. Comprehensible Classifiers in Logical Analysis of Data

    No full text
    The main objective of this paper is to compare the classification accuracy provided large, comprehensive collections of patterns (rules) derived from archives of past observations, with that provided by small, comprehensible collections of patterns. This comparison is carried out here on the basis of an empirical study, using several publicly available datasets. The results of this study show that the use of comprehensive collections allows a slight increase of classification accuracy, and that the “cost of comprehensibility” is small

    Consensus Algorithms for THE GENERATION OF ALL MAXIMAL BICLIQUES

    No full text
    We describe a consensus-type algorithm for determining all the maximal complete bipartite (not necessarily induced) subgraphs of a graph. We show that by imposing a particular order in which the consensus type operations should be executed, this algorithm becomes totally polynomial. By imposing a further restriction on the way the algorithm has to be executed, we derive an improved variant of it, the complexity of which is bounded by a polynomial which is cubic in the input size, and only linear in the output size, and show its high efficiency on numerous computational experiments on randomly generate
    corecore